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ABSTRACT 

C/s-regulatory modules (CRMs) and motifs play a 
central role in tissue and condition-specific gene ex- 
pression. Here we present Imogene, an ensemble 
of statistical tools that we have developed to facil- 
itate their identification and implemented in a pub- 
licly available software. Starting from a small train- 
ing set of mammalian or fly CRMs that drive sim- 
ilar gene expression profiles, Imogene determines 
de novo c/s-regulatory motifs that underlie this co- 
expression. It can then predict on a genome-wide 
scale other CRMs with a regulatory potential simi- 
lar to the training set. Imogene bypasses the need 
of large datasets for statistical analyses by making 
central use of the information provided by the se- 
quenced genomes of multiple species, based on the 
developed statistical tools and explicit models for 
transcription factor binding site evolution. We test 
Imogene on characterized tissue-specific mouse de- 
velopmental CRMs. Its ability to identify CRMs with 
the same specificity based on its de novo created 
motifs is comparable to that of previously evaluated 
'motif-blind' methods. We further show, both in flies 
and in mammals, that Imogene de novo generated 
motifs are sufficient to discriminate CRMs related 
to different developmental programs. Notably, purely 
relying on sequence data, Imogene performs as well 
in this discrimination task as a previously reported 
learning algorithm based on Chromatin Immunopre- 
cipitation (ChIP) data for multiple transcription fac- 
tors at multiple developmental stages. 



INTRODUCTION 

The identification and functional characterization of the 
non-coding sequences that direct the spatio-temporal speci- 
ficity of gene expression in eukaryotes is of fundamen- 
tal importance in developmental biology (1) and can find 
crucial applications in medicine (2). These regulatory se- 
quences are generally located distally from gene promot- 
ers and termed enhancers or more generically c/^-regulatory 
modules (CRMs) since they can either enhance or repress 
gene expression (3). They usually are of the order of 500 nu- 
cleotides (nts) long and can be located as far as several mega 
base-pairs away from the transcription start sites (TSSs) of 
the genes that they regulate. CRMs are composed of tran- 
scription factor binding sites (TFBSs) that bring spatio- 
temporal specificity to the expression of their target pro- 
moters (4). Detailed studies in both flies and vertebrates 
(5) have shown that CRMs contain multiple binding sites 
for transcription factors (TFs) that can be either identical 
(homotypic clustering) or different (heterotypic clustering). 
Homotypic clustering can provide cooperative TF binding 
and sharp on-off gene expression whereas heterotypic clus- 
tering allows for combinatorial gene regulation. The extent 
to which the order and relative positioning of the different 
TFBSs in CRMs matter remains however debated (6,7). 

With the advent of ChlP-seq techniques, genome-wide 
studies are providing large amount of data on the binding 
loci of tissue-specific TFs (8), as well as on other factors that 
regulate transcription, e.g. by modifying chromatin struc- 
ture (p300, CTCF, histone marks, etc.) (9,10). These pro- 
tein binding data have helped the identification of numer- 
ous CRMs specific to well-defined developmental processes 
and it has brought important information on CRM struc- 
ture. However, genome-wide studies suffer from limitations. 
A full characterization of regulatory mechanisms would re- 
quire ChlP-seq analysis to be performed for every potential 
regulatory factor, on every tissue, at multiple developmental 
stages. The results would also have to be obtained for the of- 
ten heterogeneous cells that constitute the tissue of interest 
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instead of being averaged over them as it usually needs to 
be the case. Finally, and very importantly, binding cannot 
be equated to functional regulation. 

Therefore, in silico identification of CRMs forms a useful 
complement to genome-wide binding studies. Classic case- 
by-case studies or large-scale binding data (11), as previ- 
ously described, often provide a moderate number (about 
ten to a few tens) of CRMs, active in the co-regulation of a 
subset of genes, in specific biological systems or in the for- 
mation of different organs at various stages of development. 
Identifying the important binding sites on these known se- 
quences would help to bypass some of the limitations of 
large-scale studies by providing information on the factor 
involved, both known and new, as well as on the existence 
of a regulatory grammar (12). It should also help one to 
determine other CRMs providing specific expression pat- 
terns, a difficult task at present given the absence of close 
association (13) between CRMs and their target genes in 
higher eukaryotes. These labor-intensive experimental tasks 
could be eased by computational work. To this end, we have 
previously developed (14) statistical tools to determine cis- 
regulatory elements de novo in a set of input DNA sequences 
encoding a common transcriptional regulation. They allow 
the determination of regulatory elements from input DNA 
sequences without any prior information on the TFs acting 
in cis or on their binding sites. They make central use of the 
phylogenetic information contained in the aligned DNA se- 
quences of related species. The method was applied to the 
Drosophila melanogaster gene expression program in sen- 
sory organ precursor cells (SOPs), a specific type of neu- 
ral progenitor cells (14). Predicted motifs included already 
characterized TFBS as well as new motifs and were suc- 
cessfully tested by mutational analysis. These motifs were 
used to rank intergenic DNA fragments genome-wide for 
their regulatory potential in SOPs. Of the top 29 predicted 
CRMs, 38% were found by transgenic assays to direct tran- 
scription in SOP. A larger fraction (65%) drove more gener- 
ally transcription in neural precursors. 

This successful application to a Drosophila transcrip- 
tional program led us to try and extend the method devel- 
oped in (14) to the case of mammalian CRMs. The task 
of determining c/^-regulatory elements is even more dif- 
ficult for mammalian genomes than for Drosophila ones 
since they are an order of magnitude richer in intergenic 
sequences (15,16). To tackle this challenge, we have devel- 
oped Imogene, a computer algorithm and software that we 
present here and characterize. Imogene predicts: 

(1) c/^-regulatory sequences (of about 10 nt long) within a 
moderate set size of 10-30 CRMs, responsible for spe- 
cific gene co-regulation, as well as a set of probability 
weight matrices (PWMs) or motifs (17,18) characteriz- 
ing the DNA-binding specificity of the associated puta- 
tive factors. 

(2) novel CRMs at the genomic scale with the same expres- 
sion pattern as the starting set of CRMs, based on the 
set ofbuihPWMs. 

Numerous algorithms have already been developed to try 
and map c/^-underlying transcriptional regulation (see, e.g. 
(3,17,19-21) for recent reviews). Imogene differs from previ- 



ous methods in several respects. Imogene aim is most similar 
to the goal of the 'motif-blind' algorithms analyzed in (22). 
These algorithms have been specially designed to character- 
ize the specificity of a small set of CRMs, contrary to other 
algorithms that are aimed at the analysis of large datasets 
such as whole ChlP-seq peak regions (23). As Imogene, they 
work de novo instead of using already characterized bind- 
ing motifs (24-32). Faced to the weak statistical discrim- 
inative power offered by the starting set of characterized 
CRMs, the algorithms of (22) try and distinguish regula- 
tory sequences by their entire content in short nucleotide 
sequences as also proposed in other works (33-37). On the 
contrary, Imogene insists on building c/^-regulatory motifs 
since those are important for experimental work. It instead 
relies on conservation and the comparison of multiple se- 
quenced genomes. 

In the following, the general methodology of Imogene is 
first presented. Then, Imogene performance on mammalian 
CRMs is assessed. Imogene is trained on CRMs pertain- 
ing to neural tube and limb developmental programs dur- 
ing embryogenesis. It is shown to successfully classify other 
CRMs in the same class based on its de novo created hst 
of best motifs that contained both new and already known 
motifs. Imogene is furthermore found to perform compa- 
rably to 'motif-blind' algorithms using the benchmark and 
methodology of (22). We then consider the distinct but re- 
lated task of discriminating CRMs with different specifici- 
ties, rather than discriminating a set of specific CRMs from 
background intergenic sequences. Imogene is shown to ac- 
curately discriminate mammalian neural tube from limb 
CRMs on the basis of very few learned motifs. To further 
assess the performance of Imogene, it is applied to the dis- 
crimination of five sets of mesodermal fly CRMs, a task 
previously considered in (38). Remarkably, the CRM clas- 
sification solely based on Imogene de novo generated motifs 
is found to be of similar quality as the results obtained in 
(38) based on ChIP binding data for multiple TFs at several 
developmental time points. Finally, the developed publicly 
available Imogene interface is presented. 

MATERIALS AND METHODS 

Genome alignments 

The alignments were downloaded from ftp://ftp. 
ensembl.org/pub/release-63/emf/ensembl-compara/ 
epo_12_eutherian for mammals and from http: 
//www.biostat.wisc.edu/ cdewey/fly_CAFl/data for 
Drosophilae. For the latter case, we have used the align- 
ments engineered by A. Caspi with the help of the Mercator 
and MAVID programs. In both cases, the alignments were 
processed through a customized script to produce align- 
ments in FASTA format, mask for coding sequences (CDS) 
and simple repeats (see below). These scripts are available 
in the Imogene distribution. 

Annotations 

The CDS coordinates were downloaded from 
ftp://ftp.ensembl.org/pub/release-64/gtf/mus_musculus 
for mammals (mm9 coordinates) and from ftp: 
//ftp.flybase.net/releases/FB2011_06/dmel_r5.38/gff for 
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Drosophilae (release 5 coordinates). In the case of mammals, 
the TSS coordinates were obtained separately from http: 
//hgdownload . cse. ucsc . edu/goldenPath/mm9/ database . 
Mammalian alignments were already masked for repeat 
sequences. Drosophilae aUgnments were masked using the 
coordinates indicated in the gff file. 



Phylogenetic trees 

The phylogenetic trees used within Imogene are displayed in 
Supplementary Figure SI. For Drosophilae, the distances 
are taken from Heger and Pouting (39). For mammals, they 
are obtained from the Ensembl (15) website (www.ensembl. 
org). 



Background sequences 

Imogene computes the statistical over-representation of the 
predicted motifs by comparing them to 20 Mb of back- 
ground intergenic DNA (10"^ regions of 2 kb). The script 
that generates the random coordinates is included in the dis- 
tribution of Imogene as well as the actual coordinates of the 
produced intergenic regions. 



Training sets 

The two used mammalian training sets (limb and neu- 
ral tube) were obtained from http://enhancer.lbl.gov based 
on the work of Visel et al (11). They were manually cu- 
rated to produce a high-quality dataset, with respectively 
41 CRMs for the limb and 33 for the neural tube. We fur- 
ther pruned out uninformative CRMs for which no mo- 
tifs could be generated, either because of repeat masking 
or because of lack of conservation. More precisely, the ref- 
erence species sequence was scanned using a window size 
corresponding to the motif size. If a sequence did not con- 
tain any masked nucleotide, we looked in the other species 
for any unmasked sequence in the surrounding neighbor- 
hood of ±20 nt, our flexibility criterion when defining a 
conserved instance. If putative orthologous sequences were 
found in enough species to satisfy our conservation require- 
ments (see below), the site was declared as a putative con- 
served site for a regulatory motif This filtering step resulted 
in final sets of 39 limb CRMs (minimal length 789 bp, max- 
imal length 9052 bp and average length 3045 bp) and 29 
neural tube CRMs (minimal length 585 bp, maximal length 
3045 bp and average length 2419 bp). 

The Drosophilae training sets were obtained from (38). 
Coordinate files are given as Supplementary Material. 



Main program 

The main program is written in C++ and adapted from 
the program used in a previous study (14). It is distributed 
under the GNU GPL license and available as a git reposi- 
tory at http://github.com/hrouault/Imogene. The user man- 
ual is available at http://hrouault.github.io/Imogene/. The 
program can be accessed through a web interface at http: 
//mobyle.pasteur.fr/cgi-bin/portal.py#forms::imogene. 



Binding site scores 

A given motif is represented by a PWM with frequency Wi^ b 
for the base b at position /. The index / runs from Holm, the 
size of the motif, which is a parameter in the program that 
takes the same value for all considered motifs. The binding 
score of a sequence Si for such a motif is defined through the 
corresponding PWM as: 

where tt^ is the mean frequency of the base b within inter- 
genic regions (7rA,T = 0.30 and ttcg = 0.20) as measured 
on the 'background sequences' (see the 'Background se- 
quences' subsection for their detailed description). A se- 
quence is considered as a binding site in the reference species 
{D. melanogaster or Mus musculus) when its score S is larger 
than the score threshold {Ss or Sg) defined by the user of 
Imogene. 

Conservation requirements for binding sites 

Imogene iteratively builds PWM from binding sites that 
have conserved instances in different species. The conserva- 
tion requirement is that orthologous instances are found in 
at least three distant species, including the reference species. 
For mammals, the six groups of related species are com- 
posed of: M. musculus and Rattus norvegicus; Callithrix jac- 
chus, Macaca mulatta, Pongo abelii, Gorilla gorilla, Homo 
sapiens and Pan troglodytes; Bos taurus; Sus scrofa; Canis 
familiaris; Equus caballus. Similarly for flies, there are five 
groups composed of: D. melanogaster, D. sechellia, D. sim- 
ulans, D. yakuba and D. erecta; D. ananassae; D. pseudoob- 
scura and D. persimilis; D. willistoni; D. grimshawi, D. mo- 
javensis and D. virilis. 

A site instance must be found in at least three of these 
groups (with an allowed shift of up to 20 nt with the refer- 
ence species) to be considered conserved by Imogene. 

Evolutionary models 

Imogene can use two different evolutionary models, which 
vary in complexity and computational time, to compare or- 
thologous binding sites. In both models, the bases within a 
site evolve independently of each other. 

Felsenstein model. The simplest models of TFBS nu- 
cleotide evolution are copied on models of neutral evolution 
for genomic nucleotides. This procedure has been proposed 
by Sinha et al. (29,41) with the Felsenstein model of neutral 
evolution (42). In this TFBS evolution model, the transition 
probability from nucleotide b to b^ at position / in two sites 
at evolutionary distance d is defined as 

Pb^b' = ^ h,b' + (1 - ^) ^Ub'^ (2) 

where S^^b' is the Kronecker symbol, Wi^b' is the mean fre- 
quency of base b' at position / of the site (as given by the 
PWM model), and q is the probability of conservation for 
an evolutionary distance d under neutral selection (see be- 
low). 



Nucleic Acids Research, 2014, Vol 42, No. 10 6131 



When two species are close to one another, \ and the 
probabihty that the observed bases are identical is high. On 
the contrary, when the two considered species are distant 

~ 0), the observed bases are uncorrelated and reflect the 
PWM probabiHties Wi^b- 

The probability of conservation q can then be computed 
within this model by setting the PWM probabilities b to 
the mean genomic frequencies tt^: 



exp 



l/2 + 47rA,T:^c,G 



(3) 



with TtAj (resp. ttcg) the common genomic frequency of A 
and T (resp. C and G). 

Halpern-Bruno model The Halpern-Bruno model (HB) 
(43) differs in two ways from the simplest Felsenstein model. 
It uses the more complex Hasegawa, Kishino and Yano 
model (HKY) (44) for the neutral evolution of nucleotides 
and adds a fixation probability based on fitness differences 
for the evolution of nucleotides within the TFBS. 

The HKY model improves on the Felsenstein model by 
taking into account the observed dependence of the mu- 
tation rate on the chemical nature of the bases. Substitu- 
tions between bases of the same chemical nature (purine or 
pyrimidine), also called transitions, are generally more fre- 
quent than the other type of mutations called transversions. 
This is encapsulated in the HKY model by the parameter k 
which is the ratio of the transition rate over the transver- 
sion rate. It is measured to be /c = 2 in flies and /c = 3.7 in 
mammals (45). 

Within a TFBS, the HB model extends the HKY model 
to take into account an additional purifying selection on the 
nucleotide identities (43). It is formulated by the following 
transition probabilities: 

Pb^b' = exp(rH)^,^s (4) 

where H is the rate matrix defined by 



Hb 



(5) 



The evolutionary time t is expressed in terms of the evolu- 
tionary distance by 

d 

(6) 



t = 



1/2 + 4a: 7rA,T7rc,G 
Finally, the transition rates are defined by 
log ( ^I^) 

hb^b' = ; — otb^b' (n 

with ab^b' = a: for a transition and ab^b' = 1 for a transver- 
sion. 

Inference 

The algorithm infers in a Bayesian way the PWM w fre- 
quencies b based on observations of binding sites, as pre- 
viously described in (14). In a Bayesian framework, the pos- 
terior distribution V(w\{A]) that the matrix w represents 
the PWM binding to a set of aligned nucleotides {A} is pro- 
portional to the product of 



the a priori probability Vap(w), the 'prior', that the matrix 
w represents a PWM, 

the probability 7^(M} | w) of observing the set of aligned nu- 
cleotides given that they belong to binding sites for the 
PWM w;. 

The prior is taken to be a Dirichlet distribution with pa- 
rameters ap at each PWM position. 



be{A,T,C,G} 



at-l 
i,b ' 



(8) 



The nucleotides at different positions are assumed to be in- 
dependent and the prior for the full site is taken to be the 
product of the 7^ap(w^/) over the different positions. The pa- 
rameters are taken to be equal for Watson-Crick comple- 
mentary nucleotides since a sequence and its reverse com- 
plement are not distinguished in the description of bind- 
ing sites (i.e. we assume that binding is not biased toward 
a particular DNA strand). The two values of are fully 
determined by assuming that (i) TFBS a priori have the 
same nucleotide frequencies as the background and (ii) that 
a PWM mean a priori information content is equal to the 
input threshold score Sg. 

The probability V({A]\w) of observing the set of ahgned 
nucleotides given the PWM w is computed in a standard 
way (42) by recursion for a given PWM w and a given evo- 
lutionary model. 

The posterior distribution of the nucleotide frequencies 
at position / is thus obtained under the form 



nwi\{A]) oc Yl V(a\wi,b) n 



ab-l 



i,b 



(9) 



ae{A} 



be{A,T,C,G} 



where we omit the normalization factor. 

In the idealistic case where the aligned nucleotides rep- 
resent independent observations (infinitely distant species), 
the likelihood reduces to a multinomial distribution and the 
posterior is given by 



V(wA{A]) oc n u;, 



Nb+ab-l 
i,b 



(10) 



be{A,T,C,G} 



where Nb is the number of times the base b is observed in 
{A}. This formula allows simple analytic formulations for 
the estimator of mean and maximum posterior probability. 
The mean posterior estimate is expressed as 



yji,b = 



_ Nb-\-ab 
Hb Nb+otb' 



(11) 



Equation (11) coincides with the maximum likelihood es- 
timate for a Dirichlet 'prior' with parameters + 1 . 

In the case of a non-trivial evolutionary tree (like those of 
Supplementary Figure SI, the orthologous sites are corre- 
lated by their evolution from common ancestors. The prob- 
ability V{a I Wi^b) is a polynomial function of the ^'s. How- 
ever, it generally lacks a simple analytical expression and the 
mean posterior estimate should be computed numerically. 

Mean posterior estimation 

The mean posterior estimate was initially computed using 
a Markov chain Monte Carlo procedure (46). This turned 
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out to be a time-consuming step in the algorithm. To speed 
it up, we observed, as noted above, that the mean posterior 
estimate for a prior with Dirichlet parameters coincided 
with the maximum hkehhood estimate for a prior Dirich- 
let parameter + 1 in the case of uncorrelated observa- 
tions as well as fully correlated ones (i.e. reducing to a sin- 
gle observation). We thus reasoned that maximization with 
this modified Dirichlet prior could give a quick satisfying 
approximation for the phylogenetic trees of Supplementary 
Figure SI, which was checked on different examples. This 
procedure is thus adopted in the present version of Imo- 
gene and for the results shown here. The posterior distribu- 
tion obtained with the modified prior is maximized by us- 
ing the Nelder-Mead simplex algorithm, as implemented in 
the GNU GSL. The initial value for the estimation is taken 
to be the mean estimator in the independent species regime 
given in Equation (10). This allows one to start close to the 
quadratic region and ensures fast convergence. 

A simple example of nucleotide inference using the two evo- 
lutionary models 

To illustrate the inference of ancestral nucleotides and the 
main features of the two models, we consider in Supplemen- 
tary Figure S2 a dinucleotidic genome with bases X and Y 
and a simple phylogenetic tree with an ancestral species at 
equal evolutionary distance from the reference species and a 
daughter species. We suppose that the observed nucleotide 
at position / of an observed binding site is X, both in the 
reference and the orthologous species. 

Our goal is to infer the frequencies w y and wx = ^ — 
wy. First, there are two simple cases. For d = 0, the ob- 
servations of the same nucleotide in the two evolutionary 
branches really constitute only one observation of X. On the 
contrary, for very long evolutionary branches d ^ oo, the 
two instances of nucleotide X form two independent obser- 
vations. Using the previous result (Equation (11)) with ax 
= ay = a, the estimator of the maximum transformed pos- 
terior distribution for Nx and Ny independent instances of 
Xand 7 is 



Ny 



Wy = 



Ny-\- Nx-\- 2a 
Thus, for d=0, the inferred frequency is 

a 

Wy ■■ 



1 +2af 

while for <i ^ oo, it tends toward 

a 

Wy ■■ 



2 + 2a 



(12) 



(13) 



(14) 



Between these two extreme cases, an evolutionary model 
has to be used to estimate wy for finite evolutionary 
branches of length d. 

For the Felsenstein model, the likelihood function writes 



V(A\w) = wx[q + (1 - q)wxf + Wy(l - qfw\ 
= q^Wx + {\ -q^Wx 



(15) 



where A stands for the simple alignment considered in Sup- 
plementary Figure S2 and we used wx = I — wy. From 



this expression it can clearly be seen that the evolutionary 
model simply interpolates between the independent species 
case (d ^ oo, q = 0), where there are two observations of 
base X: V(w\A) = w\, and the fully correlated case (d = 
0,q= 1) where the two species merge and we have only one 
observation: V(w \A) = wx- The corresponding mean w r,me 
and maximum posterior w y^^na analytic estimates for finite 
d read 



a l-\-q^ 



2 a + 1 + aq^ 
1 



4(a + 1)(1 - q^) 



3a + 2 - (a + l)q^ 



[a + 2 - 3(a + 1)^2]2 + 8^2(1 _ ^2)(^ + 1)2 



Note that for the maximum posterior estimate, wy^^nsi, the 
prior exponent a + 1 has been used instead of a as explained 
above. So, the two estimates coincide at ^ = 0 and q = I. 
Both estimates are plotted as function of the evolutionary 
distance d in Supplementary Figure S2 (a = 0.1). 

For the HB model, the analogous results have been com- 
puted numerically and are also shown for comparison in 
Supplementary Figure S2. The HB model results are seen 
to be closer to the large distance limit than the Felsenstein 
model ones. Moreover, the difference between the nature of 
the estimates is seen to be comparable to the difference be- 
tween the evolutionary models. 



Filtering of motifs coming from simple repeats 

Imogene pre-processes the training set by masking repeated 
sequences with repeat masker (47) but this is not sufficient 
to eliminate the production of motifs corresponding to re- 
peated sequences. These motifs have a non-Poissonian dis- 
tribution of binding sites on intergenic sequences: one bind- 
ing site has a high probabihty to be followed by another 
one after a multiple of the repeat period. This anomalous 
distribution of binding sites biases motif ranking and di- 
minishes the algorithm CRM predicting power (14). Mo- 
tifs corresponding to repeated sequences are thus filtered 
out using the non-Poissonian characteristics of their bind- 
ing site distribution. The binding sites of each motif m are 
determined on the above-described set of TVbg = 10^ back- 
ground sequences of length L = 2 x 10^ nt. For a Poisson 
distribution, one would expect the number N^\j) of inter- 
genic sequences containing J binding sites to be 



(16) 



where A^^^ is the computed density of binding sites of the 
motif m in the set of background sequences. The deviation 
from this theoretical Poisson distribution is quantitatively 
assessed by computing the x^-like value. 



X^(m) 



E 



@{NJj)), (17) 
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where 8 is the Heaviside function = Oforx < 0, 8(x) 
= 1 for X > 0) that restricts the sum to non-zero values of 
Nmij)- Only the 75% motifs with the lowest x^(^)-value are 
retained for subsequent computations. 



Distance between motifs 

The similarity between two motifs is quantitatively assessed 
based on the overlap between the sets of their binding sites. 
The 'strict proximity' between motifs represented by two 
PWMs wi and W2 is defined by 

Prox(w^^\ w^2)) ^ 

^ Prob {[^(s, w(i)) > ^h] and [^(s, w^^)) > ^h]} 
Prob{^(s, w(i)) > ^h} + Prob{^(s, w(2)) > ^h} 

where Prob{iS(s, w) > ^h} is the probability that a sequence 
s drawn at random with the background frequencies ir^ has 
a binding score ^^(s, w) (Equation (1)) above the threshold 
iSth for the frequency matrix w. The strict proximity is com- 
puted analytically as explained in (14), where it was defined. 
To take into account potential shifts in the motifs or in 
their orientation, Prox(w^^\ w^^^) is computed for all possi- 
ble alignments of the two matrices (with a maximum shift of 
lm/2 where is motif size) in the two possible orientations. 
When shifted matrices are compared, they are completed by 
additional columns with the background frequencies (i.e. 
with no specificity). The proximity between two motifs is 
obtained simply by taking the maximum over the obtained 
strict proximities. It goes from 1 for two identical motifs to 
0 for motifs that do not share any binding site above the 
threshold. Imogene distance between two motifs is defined 
as minus the logarithm of their proximity. 



Ranking motifs 

The previous filtering step provides for each considered mo- 
tif m the density k^^^ of its binding sites on the background 
sequences and ensures that these sites are approximately dis- 
tributed in a Poissonian way. The deviation from this base- 
line distribution on the CRM of the training set (t.s.) is used 
to score each motif This is quantified by the Poissonian log- 
likelihood of the training set 



Pl(m) 



= E 

te{t.s.} 



log 



(LAL'^^)'^exp(- 



V 



(19) 



where kt is the number of instances of m on the train- 
ing set sequence t of length L^. Larger deviations from the 
baseline Poissonian distribution are supposed to reflect mo- 
tif specificity for the training set and correspond to more 
negative/better scores. 



Scoring intergenic sequences 

Given a Hst of motifs m^, a CRM E is scored as follows: 

5(£) = ^«(£,m01og(AjAf), (20) 



where n(E, rrii) is the number of binding sites for the motif 
rrii on E and Xj, X\ are the average number of binding sites 
per base on the training set and background respectively. It 
is important to note that the previously found motif binding 
sites are masked when scanning with successive motifs. Thus 
motifs with lower ranks that resemble high-ranking motifs 
do not increase artificially the CRM weight by predicting 
the same binding sequences twice. 

Selection of optimal intergenic sequences 

When ranking genome-wide intergenic sequences, with a list 
of motifs, the best intergenic sequence at a given position 
is determined as follows. The list of motifs is used to scan the 
genome for conserved binding sites above a given threshold. 
Binding sites are then grouped in successive CRMs of size 
L such as to maximize clustering. The position Ei of the 
center of the enhancer / is chosen to be the center of the 
motifs cluster: 



Ei 



X\ + + — 1 



(21) 



where X\ and are the starting positions of the first and 
last TFBSs in the cluster and L is the width of the motif 



Mammalian predictions 

Learning sets, test sets and background test sets. For each 
class, the CRMs were divided into a learning set composed 
of 15 CRMs chosen at random, the other CRMs (-^20) 
defining the test set of True Positives'. In addition, a set of 
background test regions was built using the 1 kb flanking 
sequences of the full list of CRMs. 

Such an 'adapted' background test set was used to pro- 
vide a more stringent and informative test of the algorithm. 
It prevents discrimination on the training set from the back- 
ground test set based on other features than the sought 
high-information-content motifs such as a local composi- 
tion bias. Furthermore, in order to avoid biasing the results 
toward the True Positives, uninformative sequences for Imo- 
gene (i.e. sequences where no binding site could possibly be 
found given Imogene conservation requirements) were also 
removed from this background test set. This yielded back- 
ground test sets of 72 CRMs for the limb and 57 for the 
neural tube. 

Cross-validation protocol. The learning set was used to 
learn the motif content. The 10 best motifs were then used 
to score test set CRMs and background regions. Because 
the length of the training set CRMs could vary, we decided 
to keep for each test sequence the best scoring 1 kb frag- 
ment. This process was repeated 40 times, and both genera- 
tion and scanning threshold were varied. The retrieval rate 
of test set CRMs (True Positives) among background ele- 
ments (False Positives) as a function of the score was used to 
build a Receiver Operating Characteristic (ROC) curve. The 
Area Under ROC Curve or AUC, a quantity that varies be- 
tween 0 for absolute misclassification, 0.5 for random clas- 
sification, and 1 for perfect classification, was used to evalu- 
ate the quality of prediction. The parameter set yielding the 
highest AUC was chosen as the best set. 
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Comparison with Kantorovitz et aL 

The data provided by Kantorovitz et al (22) consisted of 
eight classes of human CRMs, each class containing be- 
tween 10 and 67 CRMs, with an average of 30 CRMs per 
class. Coordinates of the human CRMs were obtained from 
http://enhancer.lbl.gov and were converted from the hu- 
man hgl9 to the mouse mm9 assembly using the UCSC 
LiftOver tool, yielding a loss of 5-10 unmapped CRMs per 
class. The extraction of the mammalian alignments with 
Imogene resulted in another loss of 1-2 CRMs per class 
for which no alignment could be retrieved. Overall, the ra- 
tio of finally retrieved over initially available number of 
CRMs per class was the following: dorsal root ganghon 
(7/10), eye (12/16), forebrain (50/67), heart (9/19), hind- 
brain rhombencephalon (25/32), limb (24/35), midbrain 
mesencephalon (36/42) and neural tube (20/23). The sen- 
sitivity of Imogene on these CRMs was then computed us- 
ing a leave-one-out cross-validation scheme as described in 
(22). More precisely, we measured the ability of Imogene to 
retrieve a bona fide CRM of a given class (the test CRM) 
embedded in 10 kb of background intergenic DNA using 
the best motifs generated from the other CRMs of the class 
(the training set). The sequence containing the test CRM 
and the intergenic DNA was scanned using a window size 
equal to the average length of the CRMs in the given class. 
The window of highest score defined the predicted CRM. 
The prediction was considered valid when the test CRM 
and the predicted CRM overlapped over at least 100 bp. 
The sensitivity was finally computed as the proportion of 
test CRMs retrieved. This process was repeated using 10 dif- 
ferent background regions. Imogene parameters were var- 
ied (evolutionary models: F and HB, Sg'. 11 and 13, Ss'. 5- 
13, number of motifs: 1-15), and the parameters giving the 
highest mean sensitivity over the 10 cross-validations were 
kept for each class. A />-value of 0.05 was computed empir- 
ically as described in (22). More precisely, the above pro- 
cess was repeated by drawing at random the best scoring 
fragment. The proportion of correctly predicted CRMs was 
computed. The process was repeated 100 000 times. The 
threshold sensitivity expected by chance with a />-value of 
0.05 was obtained as the threshold proportion above which 
the 5000 higher computed proportions lied. 

Two separate sets of tests were performed, either with or 
without making use of multiple sequenced genomes and 
conservation at the CRM retrieval step. Without conser- 
vation, tests were performed on human CRMs embed- 
ded in human intergenic background using the exact same 
composite sequences as in (22). For Scangen in its nor- 
mal mode with conservation, the mouse is the reference 
species and alignments for the background sequences used 
in (22) needed to be retrieved. To do so, we followed the 
same protocol as for the training set sequences using the 
UCSC LiftOver tool for coordinate conversion and extract- 
ing alignments with Imogene. Because the length of the fi- 
nal background sequences could change during the pro- 
cess, we redefined 10 kb background regions around the 
centers of the sequences in the mouse genome. The train- 
ing set alignments were then embedded in the center of 
the corresponding 10 kb background alignments and re- 
peats were masked using repeatmasker (47). The resulting 



sequences were finally used to conduct the leave-one-out 
cross-validation (LOOCV). We note that repeat masking 
did not significantly affect the results (data not shown). 

Leave-one-out cross-validation for the CRM discrimination 
task 

Let us note C/ the tissue class of interest. There are Mi corre- 
sponding CRMs. Let Nc denote the total number of classes. 
Our goal is to find the particular motif signature that dis- 
tinguishes these Mi CRMs from the TV^ — 1 other classes 
of CRMs. This signature corresponds in our case to a num- 
ber N of top ranked motifs with generation and scanning 
thresholds Sg and Ss. These are the three parameters we 
wish to constrain with a LOOCV procedure. 

Let us detail this procedure in the case where we distin- 
guish class Ci from the other classes Cj . The Mi CRMs of C/ 
are termed 'positive' CRMs and the Mj CRMs of each of 
the other classes are termed 'negative' CRMs. Let us note M 
= Y.i^i is the total number of CRMs. The LOOCV consists 
in withdrawing one 'test' CRM from these M CRMs, learn 
the motifs on the M — 1 resulting CRMs, and use them to 
score the let alone test CRM. For the learning step, motifs 
are generated with threshold Sg on each class (one class be- 
ing deprived of one CRM), yielding Nc sets of motifs: one 
set of positive motifs from class C/, and Nc — \ sets of nega- 
tive motifs from the other classes. The top ranked motifs 
from each set are then used to scan the M CRMs for con- 
served instances with scanning threshold Ss. Each CRM E 
is scored with respect to these A^^ sets of motifs by 

Nc 

S(E) = J2(^8jj - l)^(^), (22) 

7=1 

where ^(^) is the CRM score for the N top motifs of 
class Cj as defined below in the 'Main program' descrip- 
tion, and 8;; , = 1 if y = /, and 0 otherwise. This score sim- 
ply gives positive contributions if positive motifs are found 
on the CRM, and negative contributions if negative motifs 
are found. This scoring procedure allows to rank the test 
CRM among the other M — 1 CRMs. Ties are resolved by 
attributing their mean rank to equally scored CRMs. The 
rank of the test CRM is used rather than its raw score to 
avoid potential bias stemming from score normalization. 
Indeed, the raw score is dependent on the generated motifs, 
which differ at each step of the LOOCV. This procedure is 
repeated over all M CRMs, yielding a corresponding list of 
M ranks. This list is finally used to build a ROC curve dis- 
criminating True Positives (CRMs from class Ci) from False 
Positives (the other CRMs). The discrimination is quanti- 
fied by the AUC for a false positive rate FPR < 20%, which 
we notify as AUC20 and want to maximize. 

In our case, we used a 2D parameter grid with Sg varying 
between 7 and 13 bits by steps of 1, and Ss varying between 
Sg — 5 and Sg by steps of 1 . Both Felsenstein and HB models 
were used for motif generation. For each parameter set, the 
number of motifs used for scanning was increased from 1 
to a maximum number of 10 (actually never attained) until 
the addition of a new motif decreased the AUC20, yielding 
an optimal number of motifs N. Finally, for each class, the 
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parameter set {Sg, Ss, N} yielding the highest AUC20 was 
selected as the best parameter set. 



Motifs identification 

In order to identify the known TFs that might correspond 
to the de novo generated motifs, we used the TRANSFAC 
(48) and JASPAR (49) databases, as well as the hst of mo- 
tifs provided in (50) from HT-Selex experiments and the 
UniPROBE database (51) created from protein binding ar- 
ray data. 

In order to avoid uninformative matches, we kept mo- 
tifs that had an information content greater than 8 bits, a 
threshold approximately corresponding to four conserved 
nucleotides. This led us to keep 764 vertebrate motifs 
(934 total minus 170 below threshold) in the TRANSFAC 
database, 389 vertebrate motifs (476 total minus 87 be- 
low threshold) in the JASPAR database 575 HT-Selex mo- 
tifs (580 total minus 5 below threshold) and 488 in the 
UniPROBE database (538 total minus 50 below threshold). 

Each de novo motif was compared to all kept motifs in 
each database, using the PWM distance introduced in (14). 
During the comparison, motifs are shifted to find the best 
match with a minimal match of 5 nt. The shift is simply in- 
troduced by adding flanking nucleotides with background 
frequency on either side. The closest candidate was kept for 
identification. 
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Figure 1. Imogene workflow. The algorithm takes as input a hst of func- 
tionahy related CRMs. Homologous sequences from closely related species 
are automatically retrieved (I) and scanned in order to generate a list of pu- 
tative transcription factor motifs (II). These motifs fuel the last step con- 
sisting in the inference of related novel CRMs (III). These predicted CRMs 
can finally be compared to a set of test CRMs to evaluate the predictability 
power of the whole procedure (IV). 



RESULTS 

Description of Imogene 

Imogene has two modes that can be used in succession, as 
shown in Figure 1 and summarized here (see the 'Materials 
and Methods' section for details of their implementation). 

The first mode, Genmot, aims at extracting statistically 
meaningful PWMs from a 'training set' of functionally re- 
lated CRMs on a reference genome (the mouse M. muscu- 
lus genome for mammals; the D. melanogaster genome for 
flies). The cumulated size of the training set could in princi- 
ple be unlimited, but in practice computer execution time 
requires it to stay below 100 kb. It should also be above 
a few kb to provide a sufficient amount of information (a 
training set of about 20 kb appears as a good compromise). 
Starting from a chosen training set, Genmot performs its 
task in two steps (I and II in Figure 1): (I) Genmot first en- 
larges the training set with aligned orthologous sequences 
in other related sequenced genomes (see 'Genome align- 
ments' in the Materials and Methods section), as shown 
in Supplementary Figure SI (for the mouse, the 11 other 
aligned mammalian sequenced genomes with high cover- 
age presently available on the Ensembl project (15), the 11 
other Drosophilae sequenced genomes (16) for the fly). This 
comparative genomics step results in the creation of the 'en- 
larged training set' (step I in Figure 1). (II) In this second 
central step, Genmot builds PWMs of a given length £ (10 
nt is the default value) by scanning the training set in an 
iterative manner (step II in Figure 1). Each sequence of i 
nucleotides in the training set is used in turn to create an ini- 
tial PWM using a Bayesian prior. This PWM is then refined 
by scanning the training set to find all the PWM binding 



sites in the training set, i.e. all i nucleotide long sequences 
in the training set that have a binding score above a gener- 
ation threshold score Sg chosen at the procedure onset {Sg 
= 13 bits is the default value). These binding sites are fil- 
tered using conservation, i.e. only sites that have orthologs 
in distant species are further considered (see 'Conservation 
requirements for binding sites' in the Materials and Meth- 
ods section). A shift in alignment between a binding site on 
the reference species and its orthologs in other species is al- 
lowed for the correction of eventual alignment errors (20 nt 
is the shift default value). The ensemble of conserved bind- 
ing sites and their orthologs serve, using an evolutionary 
model, to build a refined PWM. The procedure is then it- 
erated by finding the binding sites of the refined PWM and 
using them to build a further refined PWM until conver- 
gence to a stable set of binding sites. 

The need of an evolutionary model to properly assem- 
ble binding sites (28,29,41) is simply explained. A binding 
site in the reference genome and its orthologs are all related 
through descent from their last common ancestor, and can- 
not therefore be considered as independent observations. In 
order to correctly quantify the amount of information pro- 
vided by the observation of orthologous sites, one has to 
estimate their potential of change through mutation since 
their last common ancestor. To account for this, Imogene 
can, in its present implementation, make use of either one 
of two evolutionary models of TFBS evolution at the user 
choice. The first option, 'Felsenstein model', is a simple and 
computationally fast model proposed in (41). Mutations are 
generated at the same rate in a PWM binding site than in 
the background intergenic sequences. However, the mutated 
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nucleotide in a binding site is drawn according to its fre- 
quency in the PWM at the mutated position. This is analo- 
gous to the simplest model of DNA evolution (42) but with 
nucleotide neutral relative abundances replaced by PWM 
nucleotide frequencies. The Felsenstein model is the sim- 
plest model that provides at evolutionary equilibrium nu- 
cleotide frequencies that agree with those prescribed by the 
PWM at the different positions in the binding site. The sec- 
ond option (43), 'HB model', uses an evolutionary model 
that is more complex than the Felsenstein model but is 
also more clearly grounded on theoretical population ge- 
netics ideas. It has previously been used for TFBS evolu- 
tion in (28). It allows for the inclusion of different muta- 
tional probabilities between different bases in the neutral 
background intergenic mutation model. Additionally, it in- 
cludes a fitness-dependent fixation probability for a muta- 
tion in a TFBS based on classical population genetics es- 
timates for the fixation of a mutant allele appearing in a 
homogeneous population (52). The relative fitnesses of dif- 
ferent nucleotides are determined by the requirement that 
binding site convergence to evolutionary equilibrium leads 
to the PWM nucleotide frequencies (see the Materials and 
Methods section for details). 

The described procedure produces a PWM for each I nu- 
cleotide long sequences in the training set. In a series of final 
steps (see the Materials and Methods section for a mathe- 
matically detailed description), this long list is pruned and 
ranked based on comparison of the PWM bindings sites 
on the training set to a 'background' set of intergenic se- 
quences in the reference genome (20 Mb of M Musculus 
or D. melanogaster genomic DNA). Imogene pre-processes 
the training set by masking repeated sequences with repeat 
masker (47) but, as noted in (14), this is not sufficient to 
eliminate some PWMs corresponding to repeated sequences 
from the produced list of PWMs. These PWMs have sta- 
tistically anomalous distributions of binding sites that bias 
their subsequent ranking. Therefore, in a filtering first step, 
PWMs corresponding to repeated sequences are discarded 
on the basis of their anomalous distribution of their binding 
sites in the background set (see 'Filtering of motifs coming 
from simple repeats' in the Materials and Methods section). 
Then for each remaining PWM, the distribution of its con- 
served binding sequences on the training set is compared to 
the distribution of the PWM conserved binding sequences 
on the set of background intergenic sequences. The larger 
the statistical deviation between the two distributions, the 
larger the PWM score and the more meaningful the PWM is 
deemed (see 'Ranking motifs' in the Materials and Methods 
section). In a final step, PWMs in the ranked list are com- 
pared (see 'Distance between motifs' in the Materials and 
Methods section) and, among similar ones, only the high- 
est scoring one is kept. Although the identity of the TFs 
corresponding to the different PWMs of interest is not di- 
rectly assessable by the algorithm, the comparison between 
the produced PWMs and existing databases can provide rel- 
evant information on their identity, as will be shown in the 
following sections. 

In its second mode, Scangen, Imogene determines inter- 
genic sequences in the reference genome that are consid- 
ered as putative CRMs with the same functional specificity 
as the training set. This second mode (step III in Figure 1) 



is based on the PWMs inferred in the Genmot mode. The 
algorithm scans the entire non-coding repeat-masked refer- 
ence genome and finds all the conserved binding sites above 
the scanning binding score Ss for the N first PWMs in the 
ranked list. The intergenic sequences of a given length (the 
default value is 1000 nt) are then scored according to their 
similarity to the training set in their content of PWM bind- 
ing sites (see 'Scoring intergenic sequences' in the Materials 
and Methods section). The closest the similarity in its motif 
content with the training set, the most likely an intergenic 
sequence is deemed to be functionally related to the training 
set. 

Application to mammalian developmental programs 

In order to assess Imogene performance on mammalian 
transcriptional regulation, we applied it to two sets of mam- 
malian specific CRMs that have previously been identi- 
fied starting from p300 Chip-seq data and functionally 
tested in a transient transgenic assay for activity in stage 
10 mouse embryo (11). We chose CRMs active in neu- 
ral tube and limb, as characterized in the VISTA website 
(http://enhancer.lbl.gov). For each developmental program, 
a subset of CRMs was visually selected for specificity and 
strength of expression in the tissue of interest from the pro- 
vided expression pattern. Among these selected sets, two 
limb CRMs and four neural tube CRMs contained no se- 
quence that could possibly be used to learn motifs by Imo- 
gene, due to its conservation requirements, either because 
of repeat masking or because of low conservation (see the 
Materials and Methods section). Elimination of these un- 
informative sequences produced curated training sets of 29 
neural and 39 limb CRMs (see 'Training sets' in the Mate- 
rials and Methods section). 

A cross-validation scheme was then used to measure 
Imogene predictability power (see 'Materials and Methods' 
for details). In brief, for each developmental program, the 
CRMs of the training set were divided into a learning set 
composed of 15 CRMs chosen at random, and a test set 
composed of the other CRMs used as True Positives. 

The learning set was used for motif generation by Imo- 
gene in its Genmot mode. This procedure was conducted for 
both evolutionary models using different values of the gen- 
eration parameter Sg and scanning threshold Ss to obtain 
the optimal values of these parameters for each model and 
each learning set (see Figure 2 and Supplementary Figure 
S3). 

The test CRMs of the training set were then ranked, using 
motifs generated on the learning set, against a 'background 
test set', a set of ~60 regions of 1 kb taken from the flanking 
sequences of the initial set of CRMs (see the Materials and 
Methods section). 

For different parameter sets, the test CRMs as well as the 
intergenic sequences of the background set were scored. The 
proportion of retrieved test set CRMs above a given score 
(True Positive Rate or TPR) was plotted against the propor- 
tion of appearing test background regions above the same 
score (FPR) as this score decreased to produce a so-called 
ROC curve (53). The ROC curves corresponding to differ- 
ent parameter values were then compared using the AUC, 
a quantity that is maximal at best prediction. Supplemen- 
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Figure 2. Analysis of well characterized developmental processes. We tested the algorithm on mammalian CRMs driving expression at El 1.5 in neural 
tube (A) and limb (B). For each class, CRMs were divided into a training set and a test set. Motifs were learned on the training set and used to score CRMs 
from the test set along with background regions consisting of the CRMs' 1 kb flanking sequences (see the Materials and Methods section). The displayed 
ROC curves show the proportion of test set CRMs recovered above a given score (True Positive Rate denoted by TPR) versus the proportion of recovered 
background sequence at the same score for the Felsenstein (F) and Halpern-Bruno (HE) models. The shown ROC plots are the results of 40 trials. The 
FPR < 1% region of each curve is replotted in the insets for better visibihty. For each test set and each evolutionary model, the thresholds Sg and Ss used 
for motif generation and sequence scanning are given in the figures. Black dashed lines show random discrimination. (C and D) Imogene ability to predict 
neural tube (C) and limb (D) CRMs is compared to that of the 'motif-blind' methods assessed in (22). A leave-one-out cross-validation scheme is used to 
assess the efficiency of the different methods to predict a bona fide CRM embedded in a 10 kb intergenic region, as described in (22) and in the Materials 
and Methods section. The box plot summarizes the distributions of sensitivities obtained using 10 different intergenic regions. The dashed line indicates 
the sensitivity value above which results have a probability smaller than 5% to be generated by chance. The results obtained with Imogene are shown when 
only sequences from a single species (human) is considered at the prediction stage (light brown) or when conservation between mammalian genomes is also 
used (dark brown). The different methods examined in (22) are denoted by numbers in the figure with the following correspondence with the terminology 
of (22): 1 = HexDiffrc, 2 = PAC.rc, 3 = HexYMF.s200.rc, 4 = HexMCD, 5 = D2z.cond.sl00, 6 = D2z.cond.mol.weights.rc and 7 = D2z.cond.weights. 
The HexDiff and HexYMF methods consider the words most associated in a statistical sense with the training set, and then use a weighted sum of word 
counts to score a given sequence. The PAC method (for Poisson Additive Conditional) also considers the words that are most associated with the training 
set, and then examines how over-represented each of these words is in the test sequence, relative to the assumed background, by computing a Poissonian 
/7-value. The HexMCD method trains separate fifth-order Markov chains on training modules and background sequences, and quantifies which model 
matches the test sequence better using a score proposed in (65). The three different variants of the D2z method compute dot products between the k-mer 
frequency distributions of training and test set sequences, and use their statistical significance {z score) as a scoring scheme. A fuller description of each 
method is provided in (22). 
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tary Figure S3 shows the AUC as a function of the num- 
ber of motifs N for different values of the scanning thresh- 
old Ss. One can see that the AUC increases quickly with 
the first five motifs generated, and has nearly converged to 
its maximum value when 10 motifs are kept. Therefore, we 
restricted ourselves to = 10 motifs, and constrained the 
other parameters using AUC maximization. Figure 2 shows 
the ROC curves obtained for the optimal parameters. They 
are seen to be similar for both models and both training 
sets. For the neural tube CRMs, 30% of the test set CRMs 
are retrieved at 1% FPR whereas an even larger proportion 
of 40% is obtained for the limb CRMs. The HB and the 
Felsenstein models are seen in Figure 2 to yield very simi- 
lar results in both cases. This standard procedure provides a 
test of the two modes of Imogene. Its success indicates that 
meaningful motifs were generated at the Genmot stage and 
that they were properly used at the Scangen stage to recog- 
nize the test CRM from background sequences. 

It should be noted that the test really provides only a 
lower estimate of Imogene success rate. Sequences of the 
background test set counted as 'False Positive' could, in re- 
ahty, be bona fide positive CRMs. 

One interesting feature of Imogene lies in its production 
of specific motifs. In our cross-validation procedure, differ- 
ent ranked lists of motifs were created for each randomly 
drawn test set. In order to provide a list of motifs generated 
by the algorithm, we ran Imogene on the full set of CRMs 
for each class. The corresponding 10 best motifs are shown 
in Figure S4. Figure S4 also shows the closest PWM to each 
motif in the TRANSFAC (48), JASPAR (49), HT-Selex (50) 
and UniPROBE (51) hst of motifs, as computed by Imogene 
PWM distance. Previously characterized motifs belonging 
to the considered developmental programs appear in each 
class (e.g. Oct/Pou TF family and NeuroD motif in the neu- 
ral CRMs). The motif content of each CRM is also pro- 
vided in Supplementary Figures S5 and S6. It is seen that 
the 10 best motifs appear on most CRMs of the training 
set. 

Among the existing algorithms, Kantorovitz et al (22) 
concluded that 'motif-blind' methods were the most suc- 
cessful for characterizing the specificity of a small set of 
CRMs. In order to further assess Imogene, we thus chose to 
compare its performance to that of these algorithms. Kan- 
torovitz et al (22) benchmarked the prediction of the algo- 
rithms they examined on eight mammalian CRM datasets 
promoting expression in diverse tissues. We quantified the 
ability of Imogene to characterize these different sets of 
CRMs using the cross-validation protocol of (22) in which 
the CRMs to be tested were compared to intergenic se- 
quences with similar GC content (see the Materials and 
Methods section). Conservation and phylogeny were used 
for the generation of motifs (i.e. in Imogene Genmot mode). 
The CRM scoring was performed both using conservation, 
the normal Scangen operating mode, and without it in or- 
der to provide a fair comparison between Imogene and the 
'motif-blind' methods that do not take advantage of conser- 
vation. The results are displayed in Figure 2 for the neural 
tube and limb CRM datasets and in Supplementary Figure 
S7 for the other CRM datasets. In all cases, Imogene is found 
to perform comparably to the 'motif-blind' methods (22). 
Without conservation, it appears nonetheless less efficient 



than the best 'motif-blind' methods, such as 'HexMCD'. 
The use of conservation significantly enhances Imogene pre- 
dicting power and actually makes it the top predictor for 
several datasets (5/8). The prediction of specific motifs is, of 
course, the interesting complementary feature of Imogene in 
both cases. 

Discrimination of tissue-specific CRMs in the mouse 

Given the ability of Imogene to distinguish specific CRMs 
from background sequences, we found it interesting to ap- 
ply it to the related but distinct task of distinguishing dif- 
ferent classes of CRMs. The question was previously con- 
sidered for D. melanogaster CRMs based on ChlP-seq data 
at different developmental time points (38), as detailed in 
the next section. It consists in learning features that distin- 
guish the CRMs of a given class from the CRMs of other 
classes in order to be able to predict the class of a newly ob- 
served CRM. The task differs from distinguishing CRMs 
from background intergenic sequences since motifs shared 
among different classes that , for instance, characterize the 
binding of generic CRM factors, are of no use for discrim- 
ination purposes.As a test case, we considered the neural 
tube and limb sets of mammalian CRMs used in the previ- 
ous section. Given the nature of the task, we selected in each 
set the CRMs with an expression that appeared mostly re- 
stricted to neural tube and limb. This yielded 12 neural and 
15 limb CRMs. 

As in (38), we used a LOOCV scheme in which the learn- 
ing set constituted all but one of the elements of a class, 
the remaining one being used as a test sequence. The pro- 
cess can be summarized as follows. We call the class of in- 
terest the positive class and the classes against which we 
wish to learn the negative classes. The LOOCV process be- 
gins with the exclusion of a (positive or negative) CRM that 
serves as an unobserved test CRM. Then, a set of N mo- 
tifs is learnt on the remaining CRMs of each class, yield- 
ing positive and negative motifs. These motifs are used to 
build a simple linear classifier based on a weighted score giv- 
ing positive (resp. negative) contributions to positive (resp. 
negative) motifs (see the Materials and Methods section). 
Finally, the test CRM is ranked among all CRMs by the 
build classifier and this rank is registered. A successful clas- 
sification would rank positive CRMs on top of the hst and 
attribute worse ranks to negative CRMs. Therefore, after 
processing all CRMs, the list of ranks for the positive and 
negative CRMs is represented as a ROC curve indicating the 
TPR and FPR for increasing rank. This serves to optimize 
the different parameters (the threshold for motif generation 
Sg, the threshold for sequences scanning Ss, and the num- 
ber of motifs N used to score sequences) by maximizing the 
AUC for a FPR < 0.2. 

The results are shown in Figure 3. We focus on the results 
obtained with the HB evolutionary model. Results (motifs 
and thresholds) are very comparable with the Felsenstein 
model. Motifs are shown on the right of the ROC plots and 
were generated on the positive classes with optimal param- 
eters. The two classes were optimally discriminated using 
only two motifs in each class, with specificities ^^=11,^^ = 
8, comparable to that found in the learning task of the pre- 
vious section. The closest motifs in the TRANSFAC (48) 
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Figure 3. Pattern recognition (mammals). ROC plots showing the discrimination between limb and neural CRMs using a simple linear classifier. Neural 
and limb classes are compared to each other. Thick lines correspond to a leave-one-out cross-vahdation (LOOCV) scheme with a score function based on 
the de novo generated motifs from Imogene. The results obtained with the two evolutionary models are shown (Felsenstein model (F), solid dark red line, 
with threshold parameters = 11, = 9; Halpern-Bruno (HB) model, solid light red line, with threshold parameters = 1 1, = 8). The analogous 
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lines). With this latter procedure, the discrimination is improved but is still comparable to that computed by the LOOCV, indicative of no strong overfitting 
of the training set. The corresponding discriminative motifs are shown for the whole training set learning with HB model (similar motifs are obtained with 
the F model). Black dashed line shows random discrimination. 



and JASPAR databases (49), as well as in the HT-Selex ex- 
periment list (50) and UniPROBE database (51), are shown 
in Supplementary Figure S8. The best ranking motif of the 
neural CRMs is found to be unequivocally associated with 
the TRANS FAC Oct/Pou Transcription Factor known to 
be involved in the neural tube formation (54). 

Discrimination of Dvosophila tissue-specific CRMs 

In order to further test the discriminating power of Imo- 
gene de novo generated motifs, we applied it to the CRM 
classification task reported in (38). In this work, previously 
characterized D. melanogaster CRMs were divided into five 
classes corresponding to the different tissue types in which 
they were active: mesoderm (meso), somatic muscle (SM), 
visceral muscle (VM), mesoderm and somatic muscle (meso 
& SM) and visceral and somatic muscle (VM & SM). Zinzen 
et al (38) made use of a collection of Chip-seq binding data 
for different factors and at different developmental time 
points to attribute to each CRM a total of 15 peak height 
values. It was then tested whether classical machine learn- 
ing techniques could be used to discriminate the different 
CRM classes on the basis of these extensive data. This was 
indeed found possible with a high success rate in a stan- 
dard cross-vahdation scheme: CRMs predicted to belong 
to a given class with a probability higher than 95% were in- 
deed found to belong to that class with a high success rate 
of 80%. 

This led us to wonder whether Imogene would succeed in 
classifying these different CRMs without using any bind- 
ing data, but rather on the basis of combinations of de novo 
motifs that it would itself generate. We used the set of well- 
characterized CRMs belonging to five different classes as- 
sembled in (38). We then proceeded as in the previous case 
of mammahan CRMs. 

Imogene results are shown together with the machine 
learning results of (38) in Figure 4. For clarity, we here show 



results obtained with the Felsenstein model. Results ob- 
tained with the HB model are comparable. Strikingly, with- 
out any binding data, Imogene prediction rates are compa- 
rable to the machine learning ones in the specificity range 
(FPR < 5%) used for CRM prediction in (38). Its perfor- 
mance is even better for the Meso and SM classes at high 
score. The latter case is of particular interest. The machine 
learning algorithm essentially used Mef2 ChlP-seq peak 
heights to predict SM CRMs, resulting in an incorrect clas- 
sification at high scores since this TF is required for the dif- 
ferentiation of all muscle types. However, the use of the spe- 
cific Mef2 motif obtained de novo from the SM training set 
allows one to restore a correct classification at high score 
(Figure 4C). 

On the side of each ROC plot, the de novo motifs gener- 
ated on the whole training set are displayed. The number of 
motifs shown is the optimal number used for CRM scoring 
in the leave-one-out cross-validation. Among the generated 
motifs, one can recognize 4/5 TFs for which ChlP-seq data 
were used in (38), namely Twist (motif 2, meso & SM), Mef2 
(motif 1, SM), Bin and Tin (motifs 1 and 2, VM). The Bap 
motif was not found by the algorithm and correspondingly 
it was not shown to be of importance in (38). 

In summary, our analysis indicates that Imogene not 
only determines de novo functionally relevant binding sites 
within a set of CRMs but can also be used to identify the 
more subtle differences in binding sites that underhe func- 
tional differences between related sets of CRMs. 

Web interface 

The ensemble of developed statistical tools and the allied 
computer codes are freely available at http://github.com/ 
hrouault/Imogene. In addition, they can be used through 
a user-friendly web interface (http://mobyle.pasteur.fr/ 
cgi- bin/portal. py#forms::imogene) that provides motif and 
CRM predictions for the community. This interface is pow- 
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Figure 4. Pattern recognition (Drosophilae). Recognition of classes of CRMs expressed in five tissue types: mesoderm (meso), somatic muscle (sm), visceral 
muscle (vm) , mesoderm and somatic muscle (meso & sm) and visceral and somatic muscle (vm & sm). ROC plots are obtained using a leave-one-out cross- 
validation scheme. Two classifiers are compared: a Support Vector Machine using 15 ChlP-on-chip peak heights (gray, replotted using the data and the 
program provided in (38)), and Imogene using the de novo generated motifs with Felsenstein evolutionary model (red) and a simple linear classifier (see the 
Materials and Methods section). The following thresholds were used: meso (Sg = 12, Ss = 12), meso & sm (Sg = 10, = 10), sm {Sg = 9, Ss = 4), vm (Sg 
= 10, Ss = 10) and vm & sm (Sg =11,5', = 8). 



ered by the Pasteur Institute Internet server through the 
mobyle framework (55). The input web page and an exam- 
ple output web page are shown in Figures 5 and 6 respec- 
tively. 

The input form (see Figure 5) is divided into several sec- 
tions. One of the two available algorithm modes should be 
chosen at start: 

• Genmot: given a Hst of coordinates of typically 15 en- 

hancers of 1 kb (training set), generates de novo motifs 
ranked by their score (Pl(m) in the Materials and Meth- 
ods section). 

• Scangen: given the previously generated motifs, produces a 

list of genome-wide predicted CRMs with conserved bind- 



ing sites. The rank of a CRM is based on a Poissonian 
score that takes into account the CRM content in motifs 
(as described in the Materials and Methods section) 

The group of species considered should also be speci- 
fied. The algorithm can be used on Drosophilae (with ref- 
erence species D. melanogaster) or mammals (with refer- 
ence species M musculus). The different algorithm param- 
eters such as the sought motif width, threshold specificity 
for binding sites or allowed position shifts between differ- 
ent species (see the Materials and Methods section for a de- 
tailed description) are set by default to values that have been 
found to provide reasonable results. They can be modified 
by the user to optimize the results for other training sets. 
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Figure 5. Web-based interface: input web page. A copy input web page for 
Imogene powered by the mobyle bioinformatics framework is shown. 



In mode Genmot, the user should enter the training set 
CRM coordinates. The chosen evolutionary model for the 
TFBS should also be specified. The Felsenstein mode is 
computationally faster than the HB one. The results of the 
two modes have been found to be comparable (see Figures 
2 and 3). 

In mode Scangen, the algorithm scores and ranks inter- 
genic sequences in the reference species, using a list of mo- 
tifs, as described in the first 'Results' section and in 'Mate- 
rials and Methods'. The list of de novo Genmot motifs can 
be used as input. The user can set the length of the ranked 
sequences (1 kb is the default value) and the number of scor- 
ing motifs (5 is the default value). The default values have 
been chosen for computational efficiency but changes can 
improve results (see Supplementary Figure S3). 

An example of Imogene output is displayed in Figure 6. 
The Genmot mode creates from the provided training set a 
list of ranked motifs together with their significance and 
over-representations (see the Materials and Methods sec- 
tion). The positions of these motifs on the CRM of the 
training set and on their homologous sequences in other 
species are also provided, as illustrated in Figure 6A for two 
motifs. Figure 6B shows the output of the Scangen mode for 
these two motifs. The ordered list of best-ranking intergenic 
sequences is given together with information on the closest 
TSSs. 



DISCUSSION 

We have presented Imogene, a set of statistical tools and a 
computer software able to predict de novo relevant motifs 
in a moderate size set of functionally related CRMs and 
able to infer novel CRMs with a low FPR in both Drosophi- 
lae and mammalian genomes. In contrast to methods ded- 
icated to the general discovery of CRMs (see 2156 for re- 
cent reviews and assessment), the task requires to extract 
specific information from a training set that by itself offers 
only weak statistically discriminative power. This challenge 
was previously tackled by 'motif-blind' methods (22) that 
aimed to characterize the statistical distribution of short 
sequences in the training without providing information 
on specific TFBS. Imogene makes use of a different strat- 
egy to work efficiently from a CRM set of modest size. It 
systematically exploits the information available in multi- 
ple sequenced genomes with a mode of motif inference that 
makes intrinsic use of quantitative models for binding site 
evolution. This leads it to achieve a performance for CRM 
identification comparable to 'motif-blind' algorithms (22) 
or even superior when motif conservation between differ- 
ent species is used as the CRM identification stage. But, 
in contrast to 'motif-blind' methods, Imogene also provides 
specific information on TFBS, an element of particular bi- 
ological interest as well as a crucial ingredient for further 
biological tests of bioinformatic predictions. 

Imogene relies on conservation between different species 
both as filtering step and to enlarge its training set. Phylo- 
genetic conservation between multiple sequenced genomes 
has previously been shown to provide useful information 
on c/^-regulatory motifs (57-59). Although many binding 
sites are not conserved (60), methods that use conserva- 
tion among multiple genomes were found superior to sin- 
gle genome methods in a recent assessment of methods de- 
voted to general CRM prediction (56). A simple peak phast- 
Cons score (61) was in fact found to be surprisingly efficient 
(56). In addition, ultraconservation has been found to reh- 
ably point out functional CRMs (62) in transgenic assays 
although subsequent deletion of these CRMs did not result 
in a marked phenotype (63) perhaps because of redundancy 
or too crude experimental assays. 

Phylogenetic conservation, however, cannot per se ad- 
dress the question of specific spatio-temporal expression. 
The necessary information is provided to Imogene by the 
training set of CRMs with well-characterized expression. 
Imogene aim is to extract it optimally by making full use 
of several sequenced genomes, instead of focusing on a sin- 
gle genome (32) analysis, simply comparing the reference 
genome with another (64-66) or simply adding orthologous 
sequences (67). Similarly to the Monkey algorithm of (28) 
Imogene uses a model for the evolution of motif binding 
sites to properly weigh this additional information. The two 
algorithms are however complementary since Imogene cre- 
ates de novo motifs from the training set while Monkey tests 
already well-characterized binding motifs. 

The algorithm that lies at Imogene core was previously 
applied to gene co-regulation in Drosophila (14). Motifs 
predicted to be important for sensory-organ-precursor de- 
velopment were confirmed by site-directed mutagenesis. A 
significant fraction of top predicted new CRMs based on 
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Figure 6. Web-based interface: output web page. Example of an output web page for Imogene powered by the mobyle bioinformatics framework. (A) 
Result page for the Genmot mode. Two motifs were generated from the neural tube full training set (default is five) using the same parameters as in Figure 
2. Results are shown for the training set sequence MRPS9 (intragenic). For display purposes, the beginning of the sequence, which contains no instances 
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mode. The two generated motifs were used to score putative regulatory sequences of 1 kb in the mouse genome at optimal threshold Ss = 10. The five best 
ranking sequences are shown (default is 200). 
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this predicted motifs were also shown to direct expression 
in SOP or more generally in the peripheral nervous sys- 
tem. The interest of predicting motifs de novo was further 
illustrated by a subsequent application of the algorithm 
to epidermal morphogenesis and trichome development in 
Drosophila (68). The algorithm provided a refined PWM for 
the master regulator Ovo/Shavenbaby and predicted as well 
a functionally important novel motif 

In spite of its successful application to gene co-regulation 
in Drosophila, it was not clear that the method could be suc- 
cessfully extended to decipher c/^-regulatory information in 
the notoriously more difficult case of mammalian gene ex- 
pression. We have provided here bioinformatics evidence 
that our developed algorithm indeed provides meaningful 
results in this case also. Imogene was shown to successfully 
recognize CRMs belonging to neural and limb development 
programs solely based on motifs that it has constructed de 
novo from the analysis of other CRMs. Furthermore, the 
created PWMs appear to comprise both known and new 
motifs, in strong analogy with the previous studied cases in 
the fly. 

There are currently numerous cases for which a small 
number of CRMs belonging to the same program of gene 
expression has been characterized. At the same time, a large 
number of PWMs remain to be found. This is even more 
the case for CRMs. Therefore, the use of Imogene with its 
de novo motif building ability and allied CRM identification 
should provide helpful service to the community. 

We have further shown that Imogene can discriminate 
between classes of CRMs, a capability that is clearly dis- 
tinct from general CRM prediction (56). In this task, 
Imogene should usefully complement ChlP-seq data that 
are currently obtained for many developmental programs. 
Whereas ChlP-seq provides information on the binding of 
already known factors, Imogene is able to propose new mo- 
tifs and helps to identify new involved DNA-binding co- 
factors and their binding sites. We anticipate that Imogene 
CRM discriminative ability is likely to be important for fu- 
ture studies of transcription regulation specificity in closely 
related cell types (e.g. different neuronal cell types) since 
even large-scale studies will probably not provide more than 
a few tens of differentially activated CRMs, the training set 
size targeted by Imogene. 

We thus believe that Imogene is a useful addition to ex- 
isting algorithms and softwares (32). We hope that it will 
serve as a helpful and timely tool in the difficult decipher- 
ing of gene regulation in higher eukaryotes. 
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